Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve predictability of DataQuery, DataID, and dependency tree #3018

Draft
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

djhoese
Copy link
Member

@djhoese djhoese commented Dec 13, 2024

Disclaimer: This PR's implementation is very rough at the time of writing.

This PR includes major changes to key parts of Satpy in order to resolve some inconsistencies noticed by users over the years. The high-level concepts that have been changed or updated are:

  1. DataQuery objects are now only equal to DataID objects that match all of the queries keys. Previously only the shared keys were compared. This old way meant that a DataQuery could match a DataID that didn't contain all the necessary information requested by the query.
  2. The "resolution" DataID key was not not transitive for all ID key sets. So for "default" ID keys it was False and for coordinate and minimal sets it was transitive. It made the most sense to set it to False. That is, a modified dataset is not required to have all dependencies be of the same resolution.
  3. Add and refactor a lot of tests regarding DataQuery and DataID comparisons.
  4. It should be possible to load a composite with two different sets of inputs (ex. DataQuery(name="comp", resolution=500), DataQuery(name="comp", resolution=1000)).

Remaining work

  • There are a lot of edge cases that need to be worked out. The biggest one is what happens when a DataQuery has a list of possible options. That is not handled in a lot of my dependency tree stuff.
  • See "Hindsight" below.
  • Refactoring
  • More explicit tests

Hindsight

For high-level change 1 above, I'm starting to think this was the wrong change or at least that the previous behavior had a good point. That is, if a user creates a query with a lot of keys to apply to many DataIDs from different sources, then not all DataIDs should be required to have all those keys. For example, if I specify a polarization in my query, then I don't think all composites or rather composite dependencies would be able to match that. There are currently no tests to verify this.

  • Closes #xxxx
  • Tests added
  • Fully documented
  • Add your name to AUTHORS.md if not there already

@djhoese djhoese added bug component:compositors component:dep_tree Dependency tree and dataset loading labels Dec 13, 2024
Copy link

codecov bot commented Dec 13, 2024

Codecov Report

Attention: Patch coverage is 96.96312% with 14 lines in your changes missing coverage. Please review.

Project coverage is 96.08%. Comparing base (5984c29) to head (c45ed8d).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
satpy/dataset/id_keys.py 96.15% 5 Missing ⚠️
satpy/dependency_tree.py 83.33% 5 Missing ⚠️
satpy/dataset/dataid.py 97.50% 2 Missing ⚠️
satpy/tests/test_dependency_tree.py 98.19% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3018      +/-   ##
==========================================
- Coverage   96.10%   96.08%   -0.02%     
==========================================
  Files         377      378       +1     
  Lines       55163    55213      +50     
==========================================
+ Hits        53012    53050      +38     
- Misses       2151     2163      +12     
Flag Coverage Δ
behaviourtests 3.98% <24.72%> (+0.04%) ⬆️
unittests 96.17% <96.96%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@coveralls
Copy link

coveralls commented Dec 14, 2024

Pull Request Test Coverage Report for Build 12381834226

Details

  • 447 of 461 (96.96%) changed or added relevant lines in 25 files are covered.
  • 9 unchanged lines in 4 files lost coverage.
  • Overall coverage decreased (-0.02%) to 96.188%

Changes Missing Coverage Covered Lines Changed/Added Lines %
satpy/dataset/dataid.py 78 80 97.5%
satpy/tests/test_dependency_tree.py 109 111 98.2%
satpy/dataset/id_keys.py 125 130 96.15%
satpy/dependency_tree.py 25 30 83.33%
Files with Coverage Reduction New Missed Lines %
satpy/dependency_tree.py 1 95.77%
satpy/tests/utils.py 2 93.16%
satpy/tests/reader_tests/gms/test_gms5_vissr_l1b.py 3 98.67%
satpy/tests/reader_tests/gms/test_gms5_vissr_navigation.py 3 97.18%
Totals Coverage Status
Change from base Build 12299617024: -0.02%
Covered Lines: 53294
Relevant Lines: 55406

💛 - Coveralls

@djhoese djhoese force-pushed the bugfix-greedy-dataid branch from 46eafc6 to de36fb8 Compare December 17, 2024 21:20
new_id_dict = orig_id.to_dict()
orig_id_keys = orig_id.id_keys
for query_key, query_val in query_dict.items():
# XXX: What if the query_val is a list?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my main task remaining. I'm really not sure how I want to treat this. If you ask for a composite as DataQuery(name="some_comp", resolution=[500, 1000]), what do you set in the DataID? This is before we know what dependencies were found. So does the composite DataID become DataID(name="some_comp") with no resolution, resolution 500, or resolution 1000?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo it should get the final resolution. So if the generated composite has a 500m resolution, the dataid should carry that.

Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I made a first pass. Overall, thanks a lot for the clarifications and refactorings, the code reads better now.
I have comments inline, or rather mostly questions because I’m not following everything :)

"""
return self.equal(other, shared_keys=False)

def equal(self, other: Any, shared_keys: bool = False) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not found of passing a bool here. Could it be passed something like keys_to_match instead, i.e. the list of keys to use for the matching?

satpy/dataset/dataid.py Show resolved Hide resolved
def _create_id_dict_from_any_key(dataset_key):
try:
def _create_id_dict_from_any_key(dataset_key: DataQuery | DataID | str | numbers.Number) -> dict[str, Any]:
if hasattr(dataset_key, "to_dict"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Easier to ask for forgiveness than permission"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly I changed this because mypy was getting mad about the typing and the try/except being hard to parse. It might have also been CodeScene. I knew you would prefer try/except here, but the linter's didn't like it so I thought it was OK. I can do some type ignoring if you'd still prefer the try/except.

Copy link
Member

@mraspaud mraspaud Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that’s fine…

new_id_dict = orig_id.to_dict()
orig_id_keys = orig_id.id_keys
for query_key, query_val in query_dict.items():
# XXX: What if the query_val is a list?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo it should get the final resolution. So if the generated composite has a 500m resolution, the dataid should carry that.

return new_id


def _keys_to_compare(sdict: dict, odict: dict, o_is_id: bool, shared_keys: bool) -> set:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does o_is_id stand for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other is DataID or at least looks like one. I could give the variable a longer name for sure. I went back and forth on how to handle the different cases, but at the end of the day my sanity was preserved by doing simple if/else statements on the DataID-ness of "other". Now that most of the logic is stabilizing I could revisit this. Ideas welcome.

Comment on lines +532 to +565
def _compare_key_equality(sdict: dict, odict: dict, key: str, o_is_id: bool) -> bool:
if key not in sdict:
return False
sval = sdict[key]
if sval == "*":
return True

if key not in odict:
return False
oval = odict[key]
if oval == "*":
# Gotcha: if a DataID contains a "*" this could cause
# unexpected matches. A DataID is not expected to use "*"
return True

return _compare_values(sval, oval, o_is_id)


def _compare_values(sval: Any, oval: Any, o_is_id: bool) -> bool:
if isinstance(sval, list) or isinstance(oval, list):
# multiple options to match
if not isinstance(sval, list):
# query to query comparison, make a list to iterate over
sval = [sval]
if o_is_id:
return oval in sval

# we're matching against a DataQuery who could have its own list
if not isinstance(oval, list):
oval = [oval]
s_in_o = any(_sval in oval for _sval in sval)
o_in_s = any(_oval in sval for _oval in oval)
return s_in_o or o_in_s
return oval == sval
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s probably my doing, but could we take the opportunity to find better/longer names for sval, oval, sdict, odict, etc?

@@ -424,6 +430,7 @@ def _find_compositor(self, dataset_key, query):
# one or more modifications if it has modifiers see if we can find
# the unmodified version first

orig_query = dataset_key
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. are we sure this is a query, not an id?
  2. should we rename the function argument instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm honestly not sure, but I think so. This is reminding me that I wanted to add more type annotations to this stuff for this reason, but forgot after needing some last minute changes in other parts of the code. If I remember correctly the information passed by the user gets turned into a DataQuery in one of the top-level tree methods and gets passed around from there.

Comment on lines +507 to +514
if key.get("name", default="*") != "*" and len(key.to_dict()) == 1:
# the query key is just the name and still couldn't be found
raise KeyError("Could not find compositor '{}'".format(key))

# Get the generic version of the compositor (by name only)
# then save our new version under the new name

return self._get_compositor_by_name(key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What use cases in this covering?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah the name of this method is incorrect after more refactoring was done. So the code above this is when the compositor exactly matches what the user asked for. This is usually the common case of just a compositor by name but could also include other filtering parameters like resolution or calibration or whatever else was asked for.

The method _get_compositor_by_name first gets the base compositor by just the name and then if more than one compositor matches it filters down using the other properties of the provided DataQuery. The cases where a compositor name returns more than one compositor would be if there are varying resolutions or calibrations for a single compositor. So again, not a common use case, but it was in the tests already.

Comment on lines +671 to +673
(DataQuery(name="1", resolution=[250, 500]), DataQuery(name="1", resolution=[500, 750])), # opposite order
(DataQuery(name="1", resolution=500), DataQuery(name="1", resolution=[500, 750])), # opposite order
(DataQuery(name="1", resolution=[250, 500]), DataQuery(name="1", resolution=500)), # opposite order
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look surprising?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a case of the code written first then the tests second. More importantly, this is testing code that I didn't write, but I could tell this functionality was intended. That is, I believe we support users specifying a list of things in their DataQuery (and in the filter kwargs passed to Scene.load) and the results should/could match any of the things in the list.

Or is there something else surprising about these. I'm mostly making sure that regardless of which DataQuery object's __eq__ is called (the first or the second) that things are still matched. I'm not sure if the # opposite order comment at the end of these 3 lines is valid. Most likely it is an artifact of copy/pasting the earlier test. Or maybe the comment doesn't apply to the 3rd to last case, but does for the last 2 cases.

(DataQuery(name="1", resolution="*"), dict(name="1")),
(DataQuery(name="1", resolution="*"), dict(name="1", resolution=500)),
# DataID shouldn't use * but we still test it:
(DataQuery(name="1", resolution=500), dict(name="1", resolution="*")),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is * valid for dataids?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is technically allowed since it is a string and DataID does not do any validation against it. That's why the comment is there. There is also a comment somewhere in the equality method of the DataQuery about it. There is nothing stopping it in the code from happening (a "*" in a DataID), but it also isn't expected or wanted or supported entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug component:compositors component:dep_tree Dependency tree and dataset loading
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants